feat(parser): adopt section-number module in refs/inference/text parsers by thewrz · Pull Request #115 · wrzonance/SpecR

thewrz · 2026-06-06T18:48:33Z

Summary

Stacked on #114. Fixes the silent suffix-truncation data corruption across all content-scanning parsers by adopting sectionNumberFragment() / normalizeSectionNumber():

parser/refs (prose citations): See Section 26 00 13.10 no longer stores a wrong link to base 26 00 13 — the suffix is captured and normalized.
lib/infer-section: keyword scan keeps .33 (01 33 23.33 ≠ 01 33 23), bare suffixed headers infer instead of returning none, inline titles after suffixed numbers are extracted again, and dash-separated inline titles now strip the dash (parity with the text parser).
parser/text: .txt headers keep suffixed sections AND their dash titles (previously SECTION 27 05 13.43 - TITLE truncated the section and lost the title).
parser/sec: SCN/SRF values normalize-or-verbatim — whitespace/NBSP dirt canonicalized, tagged values never rejected (3 corpus files carry SCN dirt). Internal-dirt normalization is mutation-test pinned.

Test Plan

pnpm test src/parser/ src/lib/infer-section.test.ts — regression tests named by symptom (e.g. 'infer-section: keyword scan keeps .33 — 01 33 23.33 is not 01 33 23')
pnpm lint && pnpm test — 564 unit tests green at this commit; pre-existing suffix pins (27 05 13.43) untouched

Out of Scope

This PR does NOT tighten any validation gate (API schemas/worker gate land in PR 3; DB CHECK in PR 4). Parser outputs for conforming inputs are unchanged; dirty inputs are strictly improved.

Summary by CodeRabbit

Release Notes

Bug Fixes
- Enhanced section citation parsing to properly handle dotted suffixes (e.g., "Section 26 00 13.10") and agency-formatted section numbers.
- Improved whitespace normalization in section references, converting NBSP separators to canonical spacing for consistent matching.
Tests
- Expanded test coverage for edge cases in section number parsing and citation extraction.

… PRs, TDD task breakdown

…rator contracts

… string

…no more base truncation

…tion collided distinct sections

…h text parser

…d their titles

…nded shape

coderabbitai · 2026-06-06T18:48:39Z

📝 Walkthrough

Walkthrough

This PR systematizes section-number canonicalization across the parsing and inference pipeline by importing normalizeSectionNumber and sectionNumberFragment helpers and applying them consistently during section extraction in the inference layer, text parser, SEC parser, and reference extraction modules, including corresponding test coverage.

Changes

Section-number canonicalization pipeline

Layer / File(s)	Summary
Infer section layer normalization `src/lib/infer-section.ts`, `src/lib/infer-section.test.ts`	Build SECTION and bare-number regexes from `sectionNumberFragment()`. Normalize extracted sections via `normalizeSectionNumber()` in `scanKeyword` and `scanBareNumber`. Fix inline-title capture to use `inlineMatch[2]`. Tests verify decimal suffix preservation, agency-suffixed formats, inline-title extraction with dash/em-dash separators, and bare suffixed headers.
Text parser section normalization `src/parser/text/index.ts`, `src/parser/text/index.test.ts`	Build SECTION regex patterns from `sectionNumberFragment()`. Normalize extracted sections in `extractSectionMeta()` by passing first capture group through `normalizeSectionNumber()`, replacing multi-group assembly. Tests cover dash-delimited titles, agency-suffixed headers, bare suffixed header parsing, and verification that dashed headers are classified as metadata and not body text.
SEC parser section normalization `src/parser/sec/index.ts`, `src/parser/sec/index.test.ts`	Normalize `<SCN>` section numbers via `normalizeSectionNumber()` after entity decoding, with fallback to raw decoded value. Normalize `<SRF>` tag contents for `targetSpecSection` fields. Tests verify whitespace canonicalization for `SCN` variants, NBSP-to-canonical conversion for `SRF` targets, and preservation of unnormalizable content like `APPENDIX B`.
Reference extraction section normalization `src/parser/refs/extract.ts`, `src/parser/refs/extract.test.ts`, `src/parser/refs/rules.ts`, `src/parser/refs/rules.test.ts`	Normalize section numbers in `buildRef` by extracting from first capture group and applying `normalizeSectionNumber()` with fallback to trimmed raw. Update CSI section reference rule regex to use `sectionNumberFragment()` instead of fixed 2-2-2 numeric pattern. Tests verify dotted-suffix preservation (`26 00 13.10`), NBSP normalization, and agency-suffix extraction.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

wrzonance/SpecR#62: Both PRs evolve section/title inference in the same area; #62 is the foundational work on src/lib/infer-section.ts section-number matching and normalization.
wrzonance/SpecR#76: Both PRs update src/parser/refs/extract.ts and src/parser/refs/rules.ts to normalize section-number construction in targetSpecSection and the CSI section reference rule.
wrzonance/SpecR#112: Both PRs modify src/parser/sec/index.ts's SCN and section-title parsing in parseSec, with different approaches to normalizing section numbers and handling entity decoding.

Poem

🐰 Hops through sections near and far,
Normalizes numbers, fixing what they are.
From XX XX XX to XX XX XX.YY,
Canonicalized sections reach the sky! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat(parser): adopt section-number module in refs/inference/text parsers' clearly and specifically summarizes the main change—the adoption of section-number module functions across multiple parser components.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/section-number-parsers

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…dependent CI

thewrz · 2026-06-06T19:15:59Z

@coderabbitai review

coderabbitai · 2026-06-06T19:16:03Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/infer-section.ts`:
- Line 18: The ReDoS concern is a false positive because sectionNumberFragment()
returns a fixed module-local constant FRAGMENT composed of fixed-width pieces
(e.g., \d{2}, optional \.\d{2}, optional single agency token) and is
interpolated into new RegExp(...) in infer-section.ts; update the code by adding
a concise inline comment next to sectionNumberFragment() and the new RegExp(...)
call explaining that FRAGMENT is fixed-width and contains no nested
unbounded/repeating groups, and optionally add a small unit/perf test for
infer-section.ts that runs the regex against long adversarial strings to prove
it doesn't exhibit catastrophic backtracking.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: a7f86d70-a57d-4ac5-9819-058461614eac

📥 Commits

Reviewing files that changed from the base of the PR and between fb18cf3 and 0d7ecd4.

📒 Files selected for processing (10)

src/lib/infer-section.test.ts
src/lib/infer-section.ts
src/parser/refs/extract.test.ts
src/parser/refs/extract.ts
src/parser/refs/rules.test.ts
src/parser/refs/rules.ts
src/parser/sec/index.test.ts
src/parser/sec/index.ts
src/parser/text/index.test.ts
src/parser/text/index.ts

thewrz added 12 commits June 5, 2026 19:00

docs(specs): design — section-number expansion across all ingest formats

61650c1

docs(plans): section-number expansion implementation plan — 4 sub-MVP…

4701aba

… PRs, TDD task breakdown

feat(lib): section-number module — expanded-shape validator + normalizer

823cc0c

test(lib): pin section-number fragment capture-group + multiline sepa…

496e2df

…rator contracts

docs(adr): ADR-020 expanded section-number shape as opaque normalized…

e3fa3c9

… string

fix(parser): prose section refs capture dotted and agency suffixes — …

86be5ec

…no more base truncation

fix(lib): section inference keeps dotted and agency suffixes — trunca…

33273bf

…tion collided distinct sections

fix(lib): strip dash separator in inferred inline titles — parity wit…

10e3544

…h text parser

fix(parser): .txt header extraction keeps suffixed section numbers an…

f89c2c4

…d their titles

feat(parser): SEC SCN/SRF section numbers normalize to canonical expa…

f156694

…nded shape

test(parser): pin internal SCN whitespace normalization

972d372

docs(parser): correct SCN comment — gates not yet landed

94a156b

thewrz added 3 commits June 6, 2026 12:15

docs(plans): fix markdownlint MD038/MD040 in plan doc

b9f6978

ci: run PR checks for all base branches — stacked sub-MVP PRs need in…

fb18cf3

…dependent CI

merge: propagate lib-branch CI trigger + docs lint fixes up the stack

0d7ecd4

coderabbitai Bot reviewed Jun 6, 2026

View reviewed changes

Comment thread src/lib/infer-section.ts

Base automatically changed from feat/section-number-lib to main June 6, 2026 20:46

thewrz merged commit 3052a21 into main Jun 6, 2026
5 checks passed

thewrz deleted the feat/section-number-parsers branch June 6, 2026 20:50

coderabbitai Bot mentioned this pull request Jun 6, 2026

chore(docs): README reflects expanded section-number grammar (#114–#117) #118

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parser): adopt section-number module in refs/inference/text parsers#115

feat(parser): adopt section-number module in refs/inference/text parsers#115
thewrz merged 15 commits into
mainfrom
feat/section-number-parsers

thewrz commented Jun 6, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

thewrz commented Jun 6, 2026

Uh oh!

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thewrz commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Out of Scope

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

thewrz commented Jun 6, 2026

Uh oh!

coderabbitai Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thewrz commented Jun 6, 2026 •

edited

Loading

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading